Python
Power BI
PostgreSQL
ETL
Data Modeling
Pandas
Snowflake Schema
OLAP
Project Overview
A comprehensive Business Intelligence solution for a UK-based online retail company, developing a data-driven solution that enables insightful analysis and informed decision-making. The project provides valuable insights into customer purchasing behavior, product performance, and sales trends across different regions and time periods.
Business Needs: Assist in deciding which country to expand into brick-and-mortar operations and identify which products to market more effectively.
Goals & Deliverables
Goals:
- Data Warehouse Implementation: Create a robust data warehouse that facilitates efficient storage, retrieval, and analysis of transactional data
- Enhanced Decision-Making: Provide actionable insights to stakeholders by analyzing historical sales data
- Performance Monitoring: Track sales performance across various dimensions such as products, customers, and time
Deliverables:
- A comprehensive data model
- Implementation of an ETL process to populate the data warehouse
- Interactive dashboards using Power BI for visualization
Implementation Phases
1. Data Collection & Preprocessing
- Collected raw transactional data from two sources including invoices, product details, and customer information
- Conducted data cleaning by removing inconsistencies, handling missing values, and standardizing formats
- Addressed missing Customer ID values by filling them with unique identifiers
2. ETL Process Implementation
- Extraction: Extracted data from Online_Retail and Customer Excel files
- Transformation: Data type transformation, splitting date columns, data derivation (calculating total revenue)
- Loading: Manual loading to PostgreSQL via CSV files
3. Data Modeling
- Designed a Snowflake schema with central fact table (order_fact)
- Dimension tables: customer_dim, customer_address_dim, time_dim, invoice_info_dim, product_dim, product_category_dim, product_info_dim
- Identified key relationships for efficient querying and reporting
Snowflake Schema Diagram
Data warehouse schema showing fact and dimension tables relationships
4. OLAP Implementation
- Implemented using Power BI with data sourced from PostgreSQL
- Used Import Mode following MOLAP (Multidimensional OLAP) model
- Preloaded data into Power BI's in-memory engine for optimal performance
5. Dashboard Development
- Built Power BI dashboards to visualize key performance indicators
- Implemented drill-down capabilities for time analysis (year → month → day)
- Added filtering capabilities for gender and country analysis
Technical Implementation
The ETL process was implemented using Python in Google Colab, handling data transformation and preparation:
# Key Data Processing Steps
import pandas as pd
import numpy as np
# Data extraction
customer = pd.read_excel('customers.xlsx')
online_retail = pd.read_excel('onlineretail.xlsx')
# Data cleaning and transformation
# Handling missing values, removing duplicates
# Data type conversions and standardization
# Creating derived columns (total revenue, profit calculations)
# Data modeling - Creating dimension tables
customer_dimension = joined_data[['Customer ID', 'Country', 'firstName', 'lastName', 'gender']]
product_dimension = joined_data[['StockCode', 'Description', 'Price']]
time_dimension = pd.DataFrame(index=pd.to_datetime(invoice_dates))
# Creating fact table with business metrics
order_fact['Total_Price'] = order_fact['Quantity'] * order_fact['Selling Price']
order_fact['Total_Profit'] = order_fact['Quantity'] * order_fact['Unit Profit']
Results & Impact
- Successfully processed 541,909 transactions from UK-based online retailer
- Analyzed data from 4,372 unique customers across multiple dimensions
- Implemented complete data quality assurance with proper handling of missing values
- Created comprehensive product categorization system with 15+ categories
- Enabled real-time business intelligence through interactive Power BI dashboards
Challenges & Future Enhancements
Challenges Addressed:
- Data Quality Issues: Extensive preprocessing required for missing and inconsistent data
- Customer ID Missing Values: Solved by filling with unique identifiers
- Data Integration: Successfully merged multiple data sources with proper mapping
Possible Enhancements:
- Automating the ETL process for better scheduling and monitoring
- Implementing HOLAP (Hybrid OLAP) for balance between storage efficiency and query performance
- Enhancing dashboards with predictive analytics using machine learning models
- Expanding data warehouse to include social media and customer feedback data for sentiment analysis